You want to analyze: summarize, visualize, model, etc.
Can be downloaded somewhere on your local machine.
Can be read fully into R…
… but maybe very slowly…
… and maybe only if it’s a parquet file.
It’s probably in a cloud database. (e.g. AWS)
Do the individual files/tables fit on disk?
Can you query subsets of the data and fit those on disk?
Data is on disk if it is stored on your computer
Data is in-memory if you load it into RAM, e.g. loading into R.
A csv file is a file type for storing data as comma separated text.
A parquet file is a file type for storing data as column information.
data.table optimizes calculations in R on data frames, via algorithmic cleverness and C implementation.
duckdb creates a SQL database locally and lets you use R Code to execute SQL operations.
arrow provides ways to read and write parquet files and to move data around between data.table, duckdb, and other formats.
data.table:
dtplyr, tidyfast for dplyr syntaxmlr3 for machine learningduckdb:
duckplyr for for dplyr syntaxodbc, for connection to cloud databasespolars in pythonA. Use data.table::fread() instead.
B. Write it to a parquet version; use arrow::read_parquet().
C. Put it in a duckdb; use queries to avoid reading the whole dataset at once.
A. Are you using vectorized functions (or could you)?
B. Use data.table - do the small speed gains add up?
C. Can you move some subsetting steps to duckdb?
A. Can you re-order the pipeline?
B. Are you doing a split-apply-combine over many groups? data.table!
C. Are you doing a subsetting process? data.table or duckdb!
D. Is it just a lot of data? duckdb and calculate in partitions.
tictoc() packagetic()
dat |>
pivot_longer(PWGTP1:PWGTP80,
names_to = "Weight_Num",
values_to = "Weight_Amount") |>
group_by(ST) |>
mutate(
max_weight = max(Weight_Amount)
)# A tibble: 5,842,000 × 210
# Groups: ST [1]
RT SERIALNO DIVISION SPORDER PUMA REGION ST ADJINC PWGTP AGEP CIT
<chr> <chr> <dbl> <chr> <chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
1 P 2021GQ000… 8 01 00800 4 04 1.03e6 35 36 1
2 P 2021GQ000… 8 01 00800 4 04 1.03e6 35 36 1
3 P 2021GQ000… 8 01 00800 4 04 1.03e6 35 36 1
4 P 2021GQ000… 8 01 00800 4 04 1.03e6 35 36 1
5 P 2021GQ000… 8 01 00800 4 04 1.03e6 35 36 1
6 P 2021GQ000… 8 01 00800 4 04 1.03e6 35 36 1
7 P 2021GQ000… 8 01 00800 4 04 1.03e6 35 36 1
8 P 2021GQ000… 8 01 00800 4 04 1.03e6 35 36 1
9 P 2021GQ000… 8 01 00800 4 04 1.03e6 35 36 1
10 P 2021GQ000… 8 01 00800 4 04 1.03e6 35 36 1
# ℹ 5,841,990 more rows
# ℹ 199 more variables: CITWP <dbl>, COW <dbl>, DDRS <dbl>, DEAR <dbl>,
# DEYE <dbl>, DOUT <dbl>, DPHY <dbl>, DRAT <dbl>, DRATX <dbl>, DREM <dbl>,
# ENG <dbl>, FER <dbl>, GCL <dbl>, GCM <dbl>, GCR <dbl>, HIMRKS <dbl>,
# HINS1 <dbl>, HINS2 <dbl>, HINS3 <dbl>, HINS4 <dbl>, HINS5 <dbl>,
# HINS6 <dbl>, HINS7 <dbl>, INTP <dbl>, JWMNP <dbl>, JWRIP <dbl>,
# JWTRNS <chr>, LANX <dbl>, MAR <dbl>, MARHD <dbl>, MARHM <dbl>, …
3.124 sec elapsed
old_pipeline <- function() {
dat |>
pivot_longer(PWGTP1:PWGTP80,
names_to = "Weight_Num",
values_to = "Weight_Amount") |>
group_by(CIT) |>
mutate(
max_weight = max(Weight_Amount)
)
}
library(dtplyr)
new_pipeline <- function() {
dat |>
rowwise() |>
summarize(
max_weight = max(PWGTP1:PWGTP80),
CIT = CIT
) |>
group_by(CIT) |>
summarize(
max_weight = max(max_weight)
)
}microbenchmark::microbenchmark(
old_version = old_pipeline(),
new_version = new_pipeline(),
times = 5
)Unit: milliseconds
expr min lq mean median uq max neval
old_version 1612.0746 1673.3494 1820.7446 1823.9169 1984.6106 2009.7713 5
new_version 191.5454 250.3799 329.5858 387.5821 408.1263 410.2953 5
bench package for memory comparisons…bench::mark(
old_version = old_pipeline(),
new_version = new_pipeline(),
check = FALSE,
max_iterations = 3
)# A tibble: 2 × 6
expression min median `itr/sec` mem_alloc `gc/sec`
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
1 old_version 2.09s 2.09s 0.478 9.45GB 0.956
2 new_version 191.34ms 199.4ms 4.51 15.24MB 28.6
results <- bench::press(
duplications = c(1, 2),
{
dat_big <- bind_rows(replicate(duplications, dat, simplify = FALSE))
bench::mark(
old_version = old_pipeline(),
new_version = new_pipeline(),
check = FALSE,
max_iterations = 3
)
}
)
results# A tibble: 4 × 7
expression duplications min median `itr/sec` mem_alloc `gc/sec`
<bch:expr> <dbl> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
1 old_version 1 2.04s 2.04s 0.490 9.45GB 0.490
2 new_version 1 194.53ms 251.4ms 3.47 15.24MB 20.8
3 old_version 2 1.96s 1.96s 0.509 9.45GB 1.02
4 new_version 2 248.67ms 262.87ms 3.80 15.24MB 20.9
Use tictoc() or proc.time() to get a feel for runtimes of larger chunks.
Use profiling to narrow down where in a chunk the slowdowns are.
Use benchmarking to compare your old version to a proposed solution.
Use benchmark testing to see how your speed ups scale with data size.